Mining Co-Occurrence Matrices for SO-PMI Paradigm Word Candidates
نویسنده
چکیده
This paper is focused on one aspect of SOPMI, an unsupervised approach to sentiment vocabulary acquisition proposed by Turney (Turney and Littman, 2003). The method, originally applied and evaluated for English, is often used in bootstrapping sentiment lexicons for European languages where no such resources typically exist. In general, SO-PMI values are computed from word co-occurrence frequencies in the neighbourhoods of two small sets of paradigm words. The goal of this work is to investigate how lexeme selection affects the quality of obtained sentiment estimations. This has been achieved by comparing ad hoc random lexeme selection with two alternative heuristics, based on clustering and SVD decomposition of a word co-occurrence matrix, demonstrating superiority of the latter methods. The work can be also interpreted as sensitivity analysis on SO-PMI with regard to paradigm word selection. The experiments were carried out for Polish.
منابع مشابه
Improving Pointwise Mutual Information (PMI) by Incorporating Significant Co-occurrence
We design a new co-occurrence based word association measure by incorporating the concept of significant cooccurrence in the popular word association measure Pointwise Mutual Information (PMI). By extensive experiments with a large number of publicly available datasets we show that the newly introduced measure performs better than other co-occurrence based measures and despite being resource-li...
متن کاملCo-Occurrence-Based Error Correction Approach to Word Segmentation
To overcome the problems in Thai word segmentation, a number of word segmentation has been proposed during the long period of time until today. We propose a novel Thai word segmentation approach so called Co-occurrence-Based Error Correction (CBEC). CBEC generates all possible segmentation candidates using the classical maximal matching algorithm and then selects the most accurate segmentation ...
متن کامل2018 Formatting Instructions for Authors Using LaTeX
Word embedding models such as GloVe rely on cooccurrence statistics from a large corpus to learn vector representations of word meaning. These vectors have proven to capture surprisingly fine-grained semantic and syntactic information. While we may similarly expect that co-occurrence statistics can be used to capture rich information about the relationships between different words, existing app...
متن کاملUsing Filtered Second Order Co-occurrence Matrix to Improve the Traditional Co-occurrence Model
Using co-occurrence statistics to measure word similarities/relatedness has applications in many areas of natural language processing. Our experiment results also indicate that two words with zero co-occurrence statistics could still be related. In this paper, we present two algorithms, both of which were evaluated on 80 synonym test questions from the Test of English as a Foreign Language (TOE...
متن کاملLexical Co-occurrence, Statistical Significance, and Word Association
Lexical co-occurrence is an important cue for detecting word associations. We present a theoretical framework for discovering statistically significant lexical co-occurrences from a given corpus. In contrast with the prevalent practice of giving weightage to unigram frequencies, we focus only on the documents containing both the terms (of a candidate bigram). We detect biases in span distributi...
متن کامل